Comparison of chloroplast genomes and phylogenetic analysis of four species in Quercus section Cyclobalanopsis

The identification in Quercus L. species was considered to be difficult all the time. The fundamental phylogenies of Quercus have already been discussed by morphological and molecular means. However, the morphological characteristics of some Quercus groups may not be consistent with the molecular results (such as the group Helferiana), which may lead to blurring of species relationships and prevent further evolutionary researches. To understand the interspecific relationships and phylogenetic positions, we sequenced and assembled the CPGs (160,715 bp-160842 bp) of four Quercus section Cyclobalanopsis species by Illumina pair-end sequencing. The genomic structure, GC content, and IR/SC boundaries exhibited significant conservatism. Six highly variable hotspots were detected in comparison analysis, among which rpoC1, clpP and ycf1 could be used as molecular markers. Besides, two genes (petA, ycf2) were detected to be under positive selection pressure. The phylogenetic analysis showed: Trigonobalanus genus and Fagus genus located at the base of the phylogeny tree; The Quercus genus species were distincted to two clades, including five sections. All Compound Trichome Base species clustered into a single branch, which was in accordance with the results of the morphological studies. But neither of group Gilva nor group Helferiana had formed a monophyly. Six Compound Trichome Base species gathered together in pairs to form three branch respectively (Quercus kerrii and Quercus chungii; Quercus austrocochinchinensis with Quercus gilva; Quercus helferiana and Quercus rex). Due to a low support rate (0.338) in the phylogeny tree, the interspecies relationship between the two branches differentiated by this node remained unclear. We believe that Q. helferiana and Q. kerrii can exist as independent species due to their distance in the phylogeny tree. Our study provided genetic information in Quercus genus, which could be applied to further studies in taxonomy and phylogenetics.


Plant samples, DNA extraction and sequencing
Tender, unwounded leaf of 4 Quercus section Cyclobalanopsis species (Quercus austrocochinchinensis, Quercus kerrii, Quercus helferiana, and Quercus rex) were harvested from three provinces in China: Yunnan, Hainan and Guizhou.Silica gel was used to dry the materials collected.Voucher specimens were saved in China West Normal University (CWNU) and sample information was listed in Table1.The improved CTAB protocol was used to extract and purify total genomic DNA from leaf tissues (6 g per species) 30 .We used the high-quality genomic DNA to constructed a 400 bp Illumina Nova Seq library according to the manufacturer's protocol.Then the sequencing was performed on the Illumina Nova Seq PE150 platform, using pair-end strategies.Quality control on the raw data used FastQC 31 .Use Adapter Removal 32 to leach the joint contamination at the 3' end; quality filtration by sliding window method.Sequencing information was provided in Table 1.

Genome structure and codon usage analyses
In order to understand the framework of whole chloroplast genomes, Geneious was used to identify the size, genes and GC content in CPGs.Then confirmed and visualized the boundaries between IRs/SCs by IRscope 38 .
The totality of codons and RSCU (relative synonymous codon usage) values were calculated by CodonW with default parameters 39 .

Sequence divergence and comparative analyses
The types of long sequence repeats (LSRs) were predicted by REPuter 40 , including type forward (F), type reverse (R), type complementary (C) and type palindromic (P), with parameters setting to: 30 bp for minimum repeat sequence, 3 for Hamming distance.In addition, MISA 41 with parameters setting of ≥ 10 for type mononucleotides, ≥ 5 for type dinucleotides, ≥ 4 for type trinucleotides, and ≥ 3 for type tetranucleotides, pentanucleotides and hexa-nucleotides were applied to predicted SSRs quantity and types.Multiple sequence alignment of CPGs were performed in mVISTA 42 , selecting Shufe-LAGAN mode when analyzing with Quercus gilva (MG678009) as a reference.After alignment the sequence by MAFFT 43 with default parameters, nucleotide diversity (Pi) values of CPGs evaluating were performed using DnaSP 44 .

Selection pressure and phylogenetic analyses
KaKs_Calculator 45 was adopted to calculated the rate of nonsynonymous mutation (Ka), synonymous mutation (Ks) in protein-coding genes.So that the results of Ka/Ks could be used to assesse the role of selection for each gene in CPGs of 11 Quercus species, seven species of which were downloaded from NCBI (Supplementary Table S7).
For the purpose of acquainting with the phylogenetic relationships, phylogenetic tree of Quercus genus were implemented using Bayesian (BI) analysis methods, based on the CPG data.The CPG sequences required for the phylogenetic analysis are shown in the Table S7, including 27 Fagaceae species downloaded from NCBI.Apply all selected CPG sequences to MAFFT 43 to align.Later MrBayes 46 was utilized to carry out the BI tree analysis on the basis of following processes: infer the best-fit nucleotide substitution model (GTR + F + I + G4) by Modeltest 47 and PAUP 48 ; Run 6,000,000 generations in Markov chain Monte Carlo (MCMC) analysis; Sample the trees each 1,000 generations, and ignore the initial 0.25 as burnin fraction.

Characteristics of the CPGs
The length of 4 CPGs assembled scoped from 160,715 bp in Q. kerrii to 160,842 bp in Q. rex.All the structures manifest same circular quadripartite tetrad, comprising of 2 single-copy areas (LSC, SSC) and a couple of inverted repeats (IRs).The length of each region was shown in Table 2.The GC content of general sequences was 36.9% for all samples.Besides, the GC content in IRs lead to 42.8%, which was greater than that in LSC and SSC areas (34.8% and 31.1%).Additionally, all the four CPGs encoded 131 genes, namely 86 CDS, 37 tRNA and eight rRNA, and it should be noted that 18 (seven CDS, seven tRNA and four rRNA) of these were iterant in the IRs.Among all of the genes, 15 have an intron and three genes (rps12, clpP, ycf3) with two.The specific distribution and function of the genes were shown in Fig. 1, Supplementary Table S1.What should be noted was that the 5' end of gene ycf1 standed in the IRa area but the 3' end standed in SSC area, therefore created a 5' end pseudogene (ycf1Ψ) in the IRb in all CPGs compared, resulting in all IRb/SSC (JSB) boundaries lying within the pseudogene ycf1Ψ.
The codon usage analysis summarized in Table 3.According to the results, sequence sizes range of extracted protein-coding genes were 64,359-64,377 bp in four Quercus section Cyclobalanopsis species; 21,453-21,459 codons were encoded.The ENC (Effective Number of Codons) value was between 49.93 and 49.97.The FOP (Frequency of Optimal Codons) value was 0.353 in Q. kerrii and 0.354 in other three samples.The GC content was between 37.93 and 37.95%.The codon preference indexes of the four species varied slightly, indicating that they had similar codon usage.The GC3 of four species ranged between 29.85 and 29.88%, indicating that they prefer codons ending with A/U.
The CDSs of 17 CPGs (four newly sequenced and 13 species of Fagaceae released in NCBI) were extracted using Geneious.Subsequently, based on the extracted sequences, the ratio of RSCU in all samples were calculated and clustered.The results showed in Fig. 3, Supplementary Table S2.We found that: Leucine (Leu) encoded with the maximum number of codons, arranging from 2044 to 2268, with the number of isoleucine (Ile 1699-1892) following.The minimum number of codons (213 to241) presented in Cysteine (Cys).( 2) The (RSCU) values varied marginally among the CDSs of 17 species.31 codons were frequently manipulated since RSCU > 1, and the remaining codons were less frequently used as their RSCU ratios were less than 1. (3) The frequently used codons include: UUA, AGA, UAA(*), GCU, UCU, GAU, ACU, and the codon usage frequency of UAC, CUC, CGC, CUG, AGC, and GAC is on the low side.Thereinto the UUA codon showed a bias in 17 CPGs due to its highest usage.No usage frequency bias (RSCU = 1) showed in the starting codons of AUG and UGG, which encoded methionine (Met) and tryptophan (Try).

Repeated sequences analysis
A total of 163 LSRs were identified among the four CPGs examined.As a whole, the amount of LSRs identified in every CPG was scoping from 37 in Q. rex to 44 LSRs in Q. helferiana.Thereinto, 14-18 were type F, 20-22 were palindromic repeats, and the number of type R was two in Q. rex when other three species were three (Fig. 4A.Supplementary Table S3).Just one complement repeat was filtrated from four species.Among these repeats, the longest repeat was 56 bp in every species, and the most common length of repeats was 30 bp. 44.5% LSRs located in IGS, and 23.5% were found in the intron region.About half repeat sequences (46.8%) were distinguished in the IR areas (Table S3).
The total quantitiy of SSRs identifed in the CPGs of four Quercus section Cyclobalanopsis species was 453, ranging from 110 in Q. austrocochinchinensis to 115 in Q. rex, among which 74-81 were type mono-, 15 were type di-, seven were type tri-, 9-11 were type tetra-, and three were penta-(Fig.4B).The most universal unit of SSRs was A/T (mono-), whose amount ranged from 69 to 76, far higher than in the other types.68% of SSRs were type mononucleotide made up of unit A/T and C/G.What's more, most of the SSRs (70.8%) were located   S4).All the type din-comprised multiple copies of unit AT/TA and AG/CT (Fig. 4C).The type of hexanucleotide was not detected in all species.Taken as a whole, no significant distinction in the number of SSR units among the four species, except the slight diferences in unit of mono-and penta-.

Sequence divergence, hotspots and selection pressure estimation
CPGs Comparative analysis could be seen in Fig. 5, revealing that high sequence similarity among the four sect.Cyclobalanopsis species.Sequences in noncoding areas were more variant than in coding areas generally.Besides, the level of sequence divergence in SCs areas were evidently higher than that in IR areas.We found eight intergenic regions were in a high degree of variation, of which seven were located at LSC areas as follows: psbA/trnH, rps16/trnK, trnQ/rps16, trnE/trnT, rbcL/accD, psbE/petL, ndhF/rpl32.One located at SSC areas, namely ndhI/ ndhG.Other than aforementioned areas, the intron area of rpoC1 showed high level of sequence divergence too.
Window length setting to 600 bp, we calculated the nucleotide diversity values to elucidate levels of diversity for all CPGs assembled in this study.The Pi values were recorded in Supplementary Table S5, ranging from 0 to 0.01083, with 0.00041 on average.When the amount of polymorphic loci outweighed the sum of mean and twofold standard deviation, the region is defined as a highly variable region 49 .Ultimately, six hotspots (Pi > 0.0022) were discovered, coding and nocoding regions each accounting for half.The greatest Pi value (0.01083) appeared in the region between gene trnK-UUU and rps16.The distribution of highly variable regions was shown in Fig. 6.In general, these regions were not located at the IR areas but all at the SC areas, which reflected an identical pattern of CPG structural variation.
To estimate the role of selection of the Quercus section Cyclobalanopsis species, Ka and Ks values of 79 unique CDS were calculated in 11 CPGs using Quercus chenii as a reference.The Ka/Ks values were simply calculated and recorded in the Supplementary Table S6, ranging from 0 to 1.471.Among which 40 protein-coding genes showed significance (Fig. 7) in 11 species.Based on the calculation results, we speculated that the purification selection may affect on most protein coding genes, as their Ka/Ks values were less than 1.At the same time, when Ka/Ks > 1 demonstrated that the positive selection was working on the genes.Therefore we identified two genes were under the positive selection, namely petA gene in Q. aliena, and ycf2 gene in Q. austrocochinchinensis, Q. rex, Q. kerrii, Q. sichourensis, and Q. neglecta.S5.

CPG architectures in four Quercus section Cyclobalanopsis species
Four species CPGs were successfully assembled of Quercus Section Cyclobalanopsis in the present paper.The size of four CPGs (ca.160 kb) conformed to the photosynthetic land plant plastid chromosomes, whose size varied from 120 to 160 kb 50 .The same quadripartite circular structure were found in the four assembled CPGs and other Quercus species [51][52][53] .Overall GC content had no distinction within the four species.After CPGs comparison, it was found that the totality, order, and function of genes were highly conservative in genus Quercus, which were also in accordance with most Fagaceae species 25,54,55 , evidencing a highly conservative CPG construction in Quercus Section Cyclobalanopsis.
Due to the duplicative nature of the IR reduced the substitution rate within this region, therefore it was of great significance to analyze the contraction and expansion of IRs in evolutionary researches 56 .In addition, the IR regions were vital in stabilizing the structure of the CPGs, which were also the main factor affecting the total length 57,58 .The results showed that boundaries of four areas in the CPGs were conserved in six Quercus section Cyclobalanopsis species.The IRs/SCs boundary of all species compared in this study were located within similar positions except for slight difference in JLB, whose displacements from rps19 presented subtle variations in different species.Most of the compared species found no significant expansion or contraction in the IR regions except the Quercus neglecta, which had a only four bp displacement between the JLB and rps19.The conservatism of Quercus section Cyclobalanopsis was demonstrated by the relatively constant length of CPGs and the minor variations in their region borders, as the same conditions with other Quercus species 25,52 .
Codon usage bias was a natural phenomenon caused by mutation, natural selection, genome composition, etc [59][60][61] .In the PCGs of four cp genomes, total 64 codons were detected, encoding 20 amino acids.We could tell from the values of RSCU and content of GC3 that the bias in codon usage towards A/U at the third position, a phenomenon that is widespread in angiosperms [62][63][64][65] .

Large repeats and simple sequence repeats
Dispersed in CPGs, long repeat sequences played an significant role the genomic inheritance, variation and the evolution of species 50,57,66 .Our study identified a total of 163 LSRs with palindromic being the most common type.The variations observed in CPGs could partially attributed to the differences in the number and types of LSRs 67 .Therefore, due to their genetic variations, LSRs can potentially provide valuable information for researches of phylogenetic relationship and population genetics.After analysis, it was found that the repeat sequences of this study were in accord with the general pattern above: about half (43.2-46.3%) of LSRs were identified in IGS, following by the coding regions and introns.Current studies had suggested that most repeats in CPG were situated in the IGS, comparing to the coding regions 15,68 .SSR had been extensively studied as a kind of effective molecular marker in various fields such as discrimination, breeding, conservation and phylogenetic research at the species and population levels [69][70][71] .A strong A/T bias, SC regions concentration (90.9-91.3%),and IGS concentration (69.6-72.7%)were detected in SSRs of four Quercus section Cyclobalanopsis species, similar to other Quercus genus species 29,72 .The numbers and types of SSRs varied slightly in Quercus genus but extensively in other families [73][74][75] .The numbers of SSRs were almost identical between Quercus section Cyclobalanopsis and section Cerris 74 , so we speculated that such case might imply that the two sections were phylogenetically more closely related.

Conservatisms, highly variable regions and selection pressure estimation
We compared the whole sequences of CPG in four species with Quercus gilva as the reference.The results indicated that there were differences in the degree of variation between regions of CPGs, with the single-copy (SC) regions having higher variation than IR regions, simultaneously the IGS regions having higher variation than coding regions.Same phenomena were found in other Quercus species 51,52,76,77 .The copy-dependent repair mechanism of CPGs could guarantee the stability of IRs construction and thereby advance the steadiness and conservation of genomes, which possibly explain the different degree of variation between IRs and SCs.In addition, due to natural selection, the coding areas tend to exhibit higher conservation than the noncoding areas [78][79][80] .The gene regions of high variability we found (namely rpoC1, clpP and ycf1) in both sequence divergence analysis and nucleotide variability (pi) assessment could be used to develop DNA barcodes, conduct species identification and systematic classification 81 .Out of the highly variable regions identified, the ycf1 gene 82 and two IGS regions: trnH-psbA, trnK-rps16 had already been selected as practicable barcode for plants [83][84][85] .
In our study, most of the Ka/Ks values were less than 1 or not available, suggesting that the emergence frequency of synonymous nucleotide substitution was more than that of non-synonymousnucleotide substitution due to the purify selection process 86,87 .We conjectured that positive selection was operating only in two genes: petA in Q. aliena and the ycf1 in multiple Quercus taxa.The ycf1 was indicated to contain multiple SSRs in many taxa and it was claimed that these SSRs were undoubted in detecting population-level polymorphisms and could also be used to compare phylogenetic relationships at the genus level or higher taxonomic levels 19,73 .Whether these divergence hotspots found in the above analysis could be utilized for DNA barcodes or estimating taxonomic evolution in genus Quercus needs more further researches.

Inference of phylogenetic relationship
Due to the complex evolutionary issues such as convergent evolution, extensive hybridization, and serious hybridization introgression in the Quercus genus, great challenges remain in the phylogenetic relationship research of www.nature.com/scientificreports/Oak trees 1,88,89 .CPGs have been demonstrated considerable utility in addressing the phylogeny relationships of angiosperms 90 .The phylogenetic trees we reconstructed based on CPGs indicated two major clades corresponding to geographic distribution: sections of Quercus and Lobatae constituted a "New World Clade" (subgenus Quercus), while the sections of Cyclobalanopsis, Cerris and Ilex forming an "Old World Clade" (subgenus Cerris) 9,89,91 .The section Ilex was paraphyletic, nested into the lineage formed by section Cerris, which was similar to the results based on plastid genome but differed from the phylogenetic relationships inferred from RAD-seq data 29,88,92 .
The morphological studies found that the four species we studied possessed compound trichome base (CTB) so that clustered into a single branch with other CTB species, distincting to the group STB (Single-celled Trichome Base) 5,93 , similar to the results of our phylogenetic study based on the CPGs.In the CTB group, Q. kerrii and Q. chungii clustered into a clade that diverging before the other four species, which had simple uniseriate thin-walled trichomes, distinct from other CTB species 5 .Q. austrocochinchinensis then clustered with Q. gilva into sister groups, which differed from the clustering results of RAD-seq data 4,94 .Q. helferiana and Q. rex gathered together, they all possessed Fasciculation trichomes 5 .Deng 3 divided the CTB species into two groups based on their comprehensive morphological characteristics, namely group Gilva (containing Q. chungii and Q Gilva) and group Helferiana, including the four species we studied.From our results, we can see that neither of these two groups had formed a monophyly, and there were mosaics between these species.Q. helferiana and Q. kerrii were far apart in the phylogeny tree, so we believe that they can exist as independent species.Nevertheless the interspecific relationship within the four species remained some controversies: for instance, the Q. kerrii and Q. austrocochinchinensis gathered for a monophyletic sister branch in multiple studies, different from our BI tree.The Q. rex was thought to be the base of 4 species, but in our analysis the Q. kerrii differentiated firstly 4,5 .Due to the presence of one node with a low support rate (0.338) in the phylogeny tree, the interspecies relationship between the two branches differentiated by this node was still unclear.The continued advancements of sequencing techniques will allow for the inclusion of more taxa and samples in future studies, facilitating further exploration of the interspecific relationships and phylogenomics of the Quercus section Cyclobalanopsis.

Conclusions
Chloroplast DNA has the characteristics of conservation and uniparental inheritance, which is of great significance for study in genetic diversity, population structure, and evolutionary relationships.In the present study, we completed the CPG basic analyses of four species in Quercus section Cyclobalanopsis and compared them with the genomes of other oak trees.Despite the overall conservation of CPG structure and gene content were obviously found, distinct sequence divergences were uncovered in alternating regions of these genomes among the studied species.The findings provide three genes including rpoC1, clpP and ycf1 as DNA barcode for future studies of species identification and systematic classification.The phylogenetic analysis based on CPG data suggested: all Quercus species were divided into two categories, and consistent with the groups divided by morphology (STB and CTB); Q. helferiana and Q. kerrii were far apart in the phylogeny tree, so we believe that they can exist as independent species.In addition, the four species we studied, along with Q. chungii and Q. gilva, clustered into one branch with a bootstrap support rate of 1.Therefore, these six species exhibit close phylogenetic relationships in both morphology and molecular aspects, and should be classified into a group.In a word, the findings obtained will facilitate further investigations into the taxonomy, phylogenetic evolution and preservation of Quercus genus.

Figure 2
Figure 2 gave the results of CPGs boundary comparison in six Quercus section Cyclobalanopsis species, which could show the borderlines of the IRs and SCs regions in CPGs.The junction of LSC and IRb (JLB) laid in IGS (intergenic spacer) of rps19 and rpl2 gene.Most samples had 11 bp shift away from the borderline for rps19 gene in JLB, except Q. helferiana and Q. neglecta, which had 13 bp and four bp shift respectively.Moreover, the demarcation of LSC and IRa was situated in the IGS of rpl2 and trnH gene, with the trnH gene shifting 15 or 16 bp from JLA. IRa/SSC boundary (JSA) was reposed within gene ycf1.What should be noted was that the 5' end of gene ycf1 standed in the IRa area but the 3' end standed in SSC area, therefore created a 5' end pseudogene (ycf1Ψ) in the IRb in all CPGs compared, resulting in all IRb/SSC (JSB) boundaries lying within the pseudogene ycf1Ψ.The codon usage analysis summarized in Table3.According to the results, sequence sizes range of extracted protein-coding genes were 64,359-64,377 bp in four Quercus section Cyclobalanopsis species; 21,453-21,459 codons were encoded.The ENC (Effective Number of Codons) value was between 49.93 and 49.97.The FOP (Frequency of Optimal Codons) value was 0.353 in Q. kerrii and 0.354 in other three samples.The GC content was between 37.93 and 37.95%.The codon preference indexes of the four species varied slightly, indicating that they had similar codon usage.The GC3 of four species ranged between 29.85 and 29.88%, indicating that they prefer codons ending with A/U.The CDSs of 17 CPGs (four newly sequenced and 13 species of Fagaceae released in NCBI) were extracted using Geneious.Subsequently, based on the extracted sequences, the ratio of RSCU in all samples were calculated and clustered.The results showed in Fig.3, Supplementary TableS2.We found that:Leucine (Leu) encoded with the maximum number of codons, arranging from 2044 to 2268, with the number of isoleucine (Ile 1699-1892) following.The minimum number of codons (213 to241) presented in Cysteine

Figure 2 .
Figure 2. CPGs boundary comparison in six Quercus section Cyclobalanopsis species.Genes shown above the lines were transfered in reverse and those displayed below were transfered forward.

Figure 3 .
Figure 3. RSCU ratios of CDS genes for CPGs in four species and its sibling species of Fagaceae.(*) indicated the stop codon.The red font means the four species we studied.

Figure 4 .
Figure 4. Distribution of repeats in four samples of Quercus section Cyclobalanopsis.Part A-type and number of LSRs.Part B-distribution of SSRs types.Part C-the number of SSR units detected.

Figure 5 .
Figure 5. Sequence alignment of the CPGs of four Quercus section Cyclobalanopsis species.The Quercus.Gilva (MG678009) was used as reference.

Figure 6 .
Figure 6.Pi values in the multiple alignments of 4 CPGs, details in Supplementary TableS5.

Table 1 .
Basic information of 4 Quercus section Cyclobalanopsis species.

Table 2 .
A summary of the statistics for the CPGs of 4 Quercus sect.Cyclobalanopsis species.

Table 3 .
Codon preference index of four species of Quercus section Cyclobalanopsis.

Table S6
Trichome Base) and CTB (Compound Trichome Base).All the CTB species were clustered into a single branch including the four species we studied.